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Abstract. The accurate prediction and characterization of DNA melting domains by 
computational tools could facilitate a broad range of biological applications. However, 
no algorithm for melting domain prediction has been available until now. The main 
challenges include the difficulty of mathematically mapping a qualitative description 
of DNA melting domains to quantitative statistical mechanics models, as well as the 
absence of 'gold standards' and a need for generality. In this paper, we introduce a new 
approach to identify the twostate regions and melting fork regions along a given DNA 
sequence. Compared with an ad hoc segmentation used in one of our previous studies, 
the new algorithm is based on boundary probability profiles, rather than standard 
melting maps. We demonstrate that a more detailed characterization of the DNA 
melting domain map can be obtained using our new method, and this approach is 
independent of the choice of DNA melting model. We expect this work to drive our 
understanding of DNA melting domains one step further. 
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1. Introduction 

The organisation of DNA melting domains is the main characteristic of DNA melting 
cooperativity. However, a mathematical definition of melting domains and an algorithm 
for locating them in a given sequence does not exist. 

It has been known since the 1970s that DNA melts stepwise, domain after domain, 
in a reproducible series of subtransitions as temperature is increased [Tj. DNA melting 
domains have also been called cooperatively melting regions (CMR) [2j, thermalites PQ, 
cooperative units, and isomelting domains [3] . DNA melting domains are regions of tens 
to hundreds of basepairs, with locations determined by the sequence. The melting of 
each domain is approximately a twostate transition. In the strict sense, twostate means 
that only two states ("11111" and "00000") are involved, no intermediate states are 
populated, all basepairs thus open and close in unison, and the melting domains are the 
smallest units of melting. Short DNAs and oligonucleotides usually consist of only one 
domain. The stability and melting temperature of a domain not only depends on its 
internal sequence composition, but also on the stabilities of adjacent domains. 

A distinction must be made between domains and bubbles: The melting domains 
indicate potential melting events, such as the creation, growth or merging of bubbles. 
Melting domains are regions with fixed boundaries, while helix-coil boundaries fluctuate 
and depend on temperature. Various approaches to computing bubbles already exist 

m mm- 

DNA melting can be predicted by various statistical mechanics models and 
algorithms El |9]. The formulation of these models involves microscopic interactions 
and variables at the basepair level, but the concept of melting domains is absent in these 
formulations. There is no explicit representation of melting domains in the input or the 
output of these algoritms. Neither are melting domains simply due to certain input 
parameters, such as a bubble nucleation barrier, but are rather emergent properties 
that depend on all the parameters and the whole sequence. The concept of melting 
domains comes into play at a postprocessing stage and in the interpretation of the 
output. For example, plots of melting curves and melting maps usually contain some 
characteristic features, such as peaks, steps or plateaux, that may be associated with 
melting domains. In some studies, this association is only qualitative, because a precise 
definition of melting domains was not made. However, in the analysis and decomposition 
of optical melting curves, a quantitative relationship between peaks and domains is 
assumed [TOj E] . 

The concept of DNA melting domains rests on work by Azbel in the late 1970s 
[T2l [13] . He described a method for determining the groundstate of DNA as a function 
of temperature. His analysis showed that a sequence is partitioned into groundstate 
domains. Each groundstate domain has a temperature at which the groundstate changes 
from helix to coil. These temperatures define the order in which the groundstate changes 
with increasing temperature. He proposed the groundstate versus temperature scenario 
as a simplified model of the DNA melting process to provide quantitative predictions. 
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Figure 1. The ad hoc segmentation method that was applied to the human genomic 
melting map |15j . The boxes indicate the up, down and flat regions that arc determined 
by the local slope of the melting map. 



The predictive power of his groundstate model soon turned out to be limited, with the 
conclusion that all microstates must be included in the partition function [T?l [2]. In 
spite of that, the groundstate model seems to have strongly influenced, if not dictated, 
the concept of DNA melting domains. In this study, we consider melting domains and 
groundstate domains to be different concepts. 

In a previous study, we approached the problem of computing melting domains. 
In a bioinformatical analysis of the human genomic melting map, we found it useful to 
extract qualitative features of the curve, such as flat plateaux and steep slopes [To]. We 
devised an ad hoc method to do such a segmentation of the melting map, illustrated 
in figure [TJ with threshold values that were chosen intuitively. Using this method, we 
identified melting domains as flat segments, based on the assumption that consecutive 
basepairs melt in unison if they have equal melting temperatures. However, this is not 
necessarily true. We only know that the reverse implication is true: basepairs share the 
same melting temperature if they melt together. In this work, we seek a more proper 
way of doing it. 

We consider "twostateness" to be the defining property of melting domains. 
Previous tradition has claimed that twostateness can be detected at the calorimetric 
level and by considering the widths of transitions. But counter examples have shown 
that those approaches were error-prone [161 H7| . Instead, we consider the preferred 
locations of melting forks. We describe a segmentation method based on boundary 
probability profiles. Already in his 1974 paper, Poland discussed the probabilities that 
a 01 or 10 boundary exists at a specific unit [IS]. This seems to have had little impact, 
however. Most studies have used the standard probability profiles — even when melting 
forks was the main interest [19j [20]. The present method produces segmentations in 
which melting domains do not cover the whole sequence, the result is not simply a 
partitioning. Instead, we find that there are twostate regions interspersed with melting 
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fork regions. 

Many cellular functions require local single-strandedness of DNA, for example, DNA 
replication and gene regulation. A number of DNA-associated proteins contribute to 
the information capacity of the four-base DNA code [2TI [22] . Thus, for instance nuclear 
matrix proteins and nucleosomes are part of the chromatin level of information, and 
participate to modulate trancription factor activity. Ultimately, the DNA sequence 
itself is insufficient to understand the dynamic interplay of the molecules, as indicated 
by the performance of prediction algorithms for the binding of these types of molecules 
along the genomic DNA. These algorithms are presently mostly based on the sequence 
level of information. In order to advance further in the understanding of these aspects, 
it appears necessary to incorporate the physical properties of DNA in a comprehensive 
way, and we here attempt to precisely define the extent of the genomic features of DNA 
melting. 



2. Methods 



There are a number of alternative DNA melting models being used by various research 
groups. The present work aims at generality: The concepts and methods should be 
applicable to each of the DNA melting models. Our approach is formulated solely in 
terms of equilibrium probabilities that should be computable with any model, without 
reference to quantities or properties pertaining to one model only. In some DNA 
melting models, the state of a basepair is a binary variable (0 or 1), where is the 
melted/coil/open state and 1 is the bound/helix/closed state, while in other DNA 
melting models, a state can be a distance and/or angle. In the latter models, however, 
a criterion for distinguishing single-stranded (ss) and double-stranded (ds) is sometimes 
applied. 

We assume that the bases in the sequence are numbered 1, . . . , N and that each 
conformation (microstate) may be represented as a chain of and 1 values. At each 
base position x, we define the probability of a or 1: 

Po(x) = P(...X0X...) (la) 

/',(.'') /'(...XIX...). (16) 

In these equations, 1 is a bound basepair (ds), is a melted basepair (ss), X indicates 
either or 1, and the sequence position x is indicated. At each specific temperature, we 
can calculate a probability profile, which is the pi(x)-values for 1 < x < N. A melting 
map can be interpolated from several probability profiles and indicates at each position 
the temperature at which p\{x) = |. Probability profiles and melting maps have been 
standard tools in biological applications. The present approach employs another set of 
probabilities. Each segment of two nearest neighbour base positions [x — l,x] has four 
possible conformations. Accordingly, we consider the probabilities 

poo(x) = P(...XOOX...) (2a) 
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Figure 2. The (a) probability profile (in black), (b) 01 boundary probability profile (in 
red), and (c) 10 boundary probability profile (in green), of the human mitochondrion 
sequence calculated at T — 83.678°C and [Na + ] = 0.075 M are viewed in the window 
9000 bp-12000 bp as an example. (Colour mnemonic: grcen=bubble start, red=bubble 
stop.) 



poi(x) = P(...X0lX...) (2b) 

Pio(aO = P(...XlOX...) (2c) 
p u ( x ) = P(...X1 1 X...) (2d) 

for 2 < x < N. poi(x) and pio(x) are called boundary probabilities. We assume that 
they can be calculated to obtain boundary probability profiles at each temperature. 
Figure [2] shows an example of 01 and 10 boundary probability profiles compared with 
the probability profile. By basic probability laws, we have: 

p (x) +pi(x) = 1 (3) 
Poo(x) + Poi(x) +Pio(x) +Pn(x) = 1 (4) 



and 
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Here we have introduced the conditional probabilities, for example, the probability pi\o(x) 
that there is a 1 at position x given that there is a at position x — 1. 

2.1. Twostate regions 

In this section, we propose a formal definition of twostateness. Usually, the adjective 
twostate indicates a property of transitions, but we will use it to indicate a property 
of sequence regions. Literally, twostate means that only two possible conformations of 
a region [x,y], the all l's and all O's, can exist at any instant and at any temperature. 
This "all-or-none" situation may be expressed as 

P(...x|_^jx...) + P(...X0---0X...) = l, (6) 

all l's all O's 

or, equivalently, that there is zero probability at all temperatures of the region containing 
any ds/ss boundaries. However, regions that are strictly twostate do not exist. We 
expect that there is always a nonzero probability of partially melted intermediates. 
A more realistic definition must incorporate a small probability of thermally induced 
"defects" in a twostate region. We achieve this by means of a threshold e > 0: A region 
[x, y] is called globally twostate if 

P(. . . xh^lx. . . ) + P(. . . X0 • • • OX. . . ) > 1 - e (7) 

all l's all O's 

at all temperatures. The term globally refers to the above two probabilities, in which 
the conformation of the whole region [x, y] is specified. Such probabilities may be 
computationally challenging. We expect that computing globally twostate regions could 
involve high algorithmic complexities. For this practical reason, we will instead study 
regions that are locally twostate. Equation §7§ applied locally to a nearest neighbour 
segment at position x can be written 

Pn(x) +p 00 (x) > 1 - e (8) 

or, equivalently, 

Poi(x) +p w (x) < e. (9) 

However, in the following definition we will require each of poi(x) and pio(x), rather 
than their sum, to be smaller than e. A region [x sta rt, ^end] is locally twostate if for each 

X £ [^start "I - 1 j -^end] ■ 

Poi(x) < e (10a) 
Pw{x) < e {10b) 

at all temperatures. A region being globally twostate implies that it is locally twostate, 
but not vice versa. However, if region [x, y] is locally twostate, then (by Boole's 
inequality): 

P(. . . Xl_^Jx. . . ) + P(. . . X (^JJX. . . ) > 1 - 2ne (11) 

all l's all O's 
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at all temperatures, where n = x en d — ^start, that is, the region is globally twostate with 
a less strict threshold value. 

2.2. Boundary probability segmentation 

So far, we have proposed global and local variants of twostate regions, but we have not 
indicated how to determine the start and end positions of such regions. For globally 
twostate regions, we leave this as an open question. For locally twostate regions, the 
starts and ends follow naturally when evaluating at each position from 2 to N whether 
or not each of the two boundary probabilities is small enough at all temperatures. This 
yields four possible cases at each position x: 

(i) VT : Poi(x) < e and VT : pio(x) < e (twostate) 

(ii) 3T : p 01 (x) > e and VT : p w {x) < e (01 fork) 

(iii) VT : Poi(x) < e and 3T : pio(x) > e (10 fork) 

(iv) 3T : Poi(x) > e and 3T : Pio(x) > e (double fork) 

By thus classifying each nearest neighbour segment [x — l,x], we obtain a segmentation 
that divides the sequence into four types of regions. For each region [x star t, x ea d\, one of 
the cases (i)-(iv) is true at all internal positions x G [x sta rt + l,£end], but not true at 
the flanking positions x sta rt and x en d + 1. A case (i) region is locally twostate, hereafter 
referred to as a twostate region. In a case (ii) region, there are large probabilities of 01 
boundaries at some temperatures, while the 10 boundary probabilities always remain 
small. This is called a 01 fork region. Likewise, in a case (iii) region, there are large 
probabilities of 10 boundaries at some temperatures, while the 01 boundary probabilities 
remain small always. This is called a 10 fork region. The latter two types of regions are 
called melting fork regions. The last case, (iv), is a region in which there may be large 
probabilities of both 01 and 10 boundaries. This is called a double fork region. 

Figure [3] illustrates that each point (x, T) in the plane can be classified and given 
a colour according to the comparison of the two boundary probabilities with e. In this 
plot, twostate regions correspond to entirely blue vertical columns, while melting fork 
regions are columns that contain red or green "islands" . Figure [3] also shows how some 
of the islands correspond to the steep slopes on the melting map. 

2.3. Sampling algorithm 

An algorithm for finding twostate regions and melting fork regions must evaluate 
statements of the type "for all temperatures" . An approximate method is to sample only 
a finite, but representative set of temperatures, for example, by scanning a range from 
Ti ow to Thigh with incremental steps AT. We use values Ti ow and Thigh corresponding 
to helicities 9 = 0.999 and 9 = 0.001, so as to cover most of the subtransitions. This 
usually means Thigh — T ow rs 20°C. The choice of step size is a trade-off. The smaller 
the AT, the more temperatures are taken into account and the more accurate is the 
segmentation, but at the cost of longer computation time. Melting forks only exist over 
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Figure 3. At each temperature (in steps of 0.1°C) and at each position x in the 
sequence, the colors red, green, blue indicate the presence of a 01 boundary (poi(x) > e 
and pio(x) < e), 10 boundary (poi(x) < e and pio(x) > e), or no boundary (poi(x) < e 
and pio(x) < e) in the parameterfree case: e = e m i n - Boundaries tend to "live" during 
temperature intervals up to 4°C wide. On top of the colour map is plotted the melting 
map (black curve). Note that the two were calculated independently from each other. 
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Figure 4. Convergence test for the sampling algorithm, using a segmentation based 
on step size of 0.005°C as reference. Different choices of step size are given on the 
x-axis, with the y-axis denoting Hamming distance between resulting segmentations 
and the reference segmentation. 



a temperature interval (figure [3]), so the resolution must be high enough to detect the 
shortest of such temperature intervals. 

Figure [4] shows how the segmentation depends on sampling resolution. A 
segmentation based on a very small step size of 0.005°C is used as reference, and the 
figure shows the Hamming distance between this reference and segmentations based 
on larger step size. The hamming distance measures the number of positions that 
differs with respect to the class (twostate and melting forks) between two compared 
segmentations. 

As can be seen from the figure, the use of step sizes above 1°C alters the resulting 
segmentation substantially, while the segmentation converges for step sizes below 0.1°C 
Based on this analysis we have chosen AT = 0.1°C as standard, since this gives a 
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Figure 5. Stacked histogram showing the total amounts of twostate regions (blue), 
01 fork regions (red), 10 fork regions (green), and double fork regions (magenta) in 
a 16571 bp sequence, plotted over a range of e- values (logarithmic axis). The double 
fork regions disappear at e m j n 0.0006. 



segmentation almost identical to segmentations achieved with smaller step sizes and 
longer computational time. 

2.4- Determining the e-value 

The parameter e is the threshold used for distinguishing large from small boundary 
probabilities. There are two ways of determining its value: (1) as an arbitrary input to 
the algorithm and (2) as a parameter-free output of the algorithm. 

Thermal fluctuations create boundaries anywhere in the sequence, which imposes 
a background noise level in the boundary probability profiles. On the other hand, 
the sequence encodes the preference for certain boundary locations, which gives rise to 
high peaks above the background noise (figure [2]). The parameter- free e-value separates 
the peaks from the noise. Figure [5] shows how the segmentation depends on e. At 
e = 0.1, the boundary probabilities are considered small almost everwhere, there are 
no melting fork regions. At the other extreme, e < 0.00001, the boundary probabilities 
are considered large almost everwhere, there are no twostate regions and most of the 
sequence is considered to be double fork. We interpret the double fork regions as a 
reflection of the background noise. We define the parameterfree e m ; n as the value above 
which the double fork regions disappear. The e min depends on the sequence and can be 
written 

e min = max[max(min{poi(^),Pio(^)})]- (12) 

T x 

To avoid detecting thermal noise, one should use an e > e min . In parameter-free 
segmentations, we do a sampling over a temperature interval [T\ ow , Thigh] to determine 



the e min by (12). Figure 5 shows that the amounts of twostate, 01 fork, and 10 fork 



regions are about a third each at e r 
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2.5. Approximations derived from probability profiles and melting maps 



Figure [2] shows a correspondence between a probability profile and the two boundary 
probability profiles: Steep increases in pi(x) coincide with peaks in p i(x), while steep 
decreases coincide with peaks in p w (x). This suggests that the twostate, 01 fork, and 10 
fork regions can be estimated approximatively from standard probability profiles alone. 
For 2 < x < N we can define the slope of a probability profile as the difference: 

Ap 1 (x)=p 1 (x)-p 1 (x-l). (13) 



By 



we derive 



Pi{x) = Poi(x) +pii(x) 
pi(x - 1) = p w (x) +pu(x) 



(14a) 
(146) 



Api(x) = p 01 (x) - pio(x). (15) 

This equation explains the correspondence in figure [2] between the slope in the 
probability profile and the peaks in the boundary probabilities. It also shows that there 
is a loss of information when two boundary probability profiles are "collapsed" into 
the probability profile. Instead of classifying each nearest neighbour segment [x — l,x] 
according to the cases (i)-(iv) above, we could do the following classification: 

if 3T : Ap^x) > e 



class(x) 




if VT 
if 3T 



\A Pl (x)\ <e 
Api(x) < —e 



(16) 



(Note that this segmentation of the probability profiles into up, down, and flat regions 



is not the same as the segmentation of the melting map.) From (16) and (15) it follows 
that each twostate region is contained inside a flat region. Furthermore, if e > e min 
(no double forks), then each up region is contained inside a 01 fork region and each 



down region is contained inside a 10 fork region. Used as an approximation, (16) would 
misclassify some (parts of) melting fork regions as being twostate. 

We have not analytically solved how these results translate to the melting map 
segmentation [T5], which has the form: 

if AT m 
if \AT m (x) 
if AT m x 
otherwise 



class(x) 



up 

flat 

down 



> ei 
< -ei 



(17) 



none 



where AT m (x) = T m (x) —T m (x — V 
melting map segmentation with e\ - 



Note that this has two thresholds: ei and e%. The 
0.13 and €2 = 0.01 is illustrated in figure [T] 
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Table 1. Twostate regions are divided into nine fork-fork types. Four of these 
correspond to meltmap types of flat regions: top, bottom, upstair, and downstair |15j . 
The subtransition indicates the most likely melting process of each region |10l 123) . 
based on Azbel's distinction according to the change in the number of boundaries 
[T2"] . The effect on T m is the deviation from the expectation based on GC% content 
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2.6. Subtransitions and fork-fork types 

Once a segmentation is produced, it is possible to distinguish different types of twostate 
regions, called fork-fork types, by identifying the nearest flanking or bracketing melting 
fork region on each side. If a twostate region is close to the sequence end (5' or 3'), then 
it is possible that no melting fork regions exist before that end. On each side, therefore, 
there are three possibilities (01, 10, either 5' or 3'), which combines to nine possible 
fork-fork types of twostate region. We write the fork-fork types in the notation 01_10, 
10_10, 5'_10, 01_3', etc. The nine fork-fork types are listed in table [TJ 

In the human genomic melting map, we divided the flat segments into four 
types called top, bottom, upstair, and downstair [T3]: We compared the melting map 
temperature averaged over a flat segment with the melting map temperatures averaged 
over the two flanking regions of equal length. The flat segment is a top if both flanking 
temperatures are lower, it is a bottom if both are higher, it is an upstair if leftside 
is lower and righthand side is higher, and downstair otherwise. Table [T] shows the 
correspondence between the four melting map types and four of the fork-fork types. 
Because of the lengths of the chromosomes, we did not consider the other five types of 
flat segment at the sequence ends. We found that the type of a flat segment (top and 
bottom) has a strong effect on its melting temperature, that is, the stability depends 
on neighbouring domains, not only the internal GC content |15j . 

These findings were in accordance with Azbel's prediction of the deviation of 
a domain's T m from the GC%-based Marmur-Doty prediction [121 OH El UJ- He 
distinguished five types of subtransition of a melting domain by the number of 
boundaries they create: (I) nucleation of a bubble, (II) unzipping from an end, (III) 
internal growth of a bubble, (IV) merging of a bubble and an unzipping end, and (V) 
merging of two bubbles. 
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Table [I] shows that the fork- fork classification corresponds to distinguishing the left 
and right variants (mirror images) of the subtransition types II, III and IV. The 5'_3' 
type corresponds to oligonucleotides that do not contain any melting fork regions and 
melt and dissociate in a single step. The table summarizes the expected relationships 
between the different types of regions and transitions. For example, when a 01_3' region 
melts, it will most likely open to merge with an existing bubble to the left and with the 
3' end to the right. Otherwise, it would create a 10 fork to the left and/or a 01 fork to 
the right, in contradiction with its type being 01_3'. The 01_3' region is destabilized by 
the flanking bubble on the left, lowering its T m . 

3. Results 

The methodology for determining melting domains is in itself the main result of this 
study. Although melting domains have been used as a concept for a long time, this 
paper presents a principled approach to calculation of the domains. With the melting 
segmentation produced by the methodology as a proposed reference, we here analyze 
melting segmentation of biological DNA sequences and compare our proposed reference 
to the previous ad hoc method for the prediction of melting domains. 

We first give a direct visualization of melting domains in a selected portion of 
the mitochondrion, and then provide some statistical results from the full human 
chromosome 21 segmentation. 

Figure [6] shows the segmentation from position 8250 to 8750 of the mitochondrial 

ssliar '■' i ■■ ii'irt — ti*tttti 




8500 8520 8540 8560 8580 8600 8620 

Figure 6. Visualization of twostate and melting fork regions in the UCSC genome 
browser. Position 8250 to 8750 of the mitochondrial genome is shown. Also shown is 
a genomic type track, in this case evolutionary conservation of the genomic sequence. 
Although there is no direct information to be obtained from this small segment of DNA, 
these two features, or other genomic features, may be compared to the information 
provided by the segmentations, in order to uncover hidden dependencies. A zoomed 
in version is also shown for a part of the segmentation, with twostate regions denoted 
in blue, and "01" and "10" forks further specified in red and green respectively. 

genome. Black bars denote twostate regions, while areas between bars denote melting 
fork regions. As can be seen from the figure, the segmentation typically forms several 
tight clusters of twostate regions. Very short melting fork regions are typically occurring 
between twostate regions within each such cluster, while one or a few long melting fork 
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regions are separating the clusters. Also, there are a few very short, isolated twostate 
regions. 

The segmentation is further compared to the melting map segmentation [T5] . 
Flat regions of this segmentation should correspond to twostate regions of the new 
segmentation and are denoted in black. Non-flat regions correspond to melting 
fork regions, with "up" regions corresponding to "01" forks and "down" regions 
corresponding to "10" forks . As can be seen from the figure, the ad hoc procedure 
used in the previous paper gave rise to a much larger amount of flat regions. The ad 
hoc segmentation typically gave very long flat regions, only separated by short non-flat 
regions. 

As twostate (flat) regions covered a much larger part of the genome according to 
the previous segmentation procedure, we also constructed a third segmentation. This 
segmentation was based on the melting map segmentation (17), but with parameter 
values e\ = 0.13 and e 2 = 0.0002 calibrated to give similar global amounts of twostate 
regions and flat segments. This calibrated melting map segmentation resembles the new 
segmentation, although some minor differences are noticeable. For instance, where one 
segmentation assigns a continuous twostate region, the other segmentation sometimes 
assigns multiple clustered regions. 

The statistical properties of the segmentations are further analyzed on human 
chromosome 21. Figure [7] shows a log-log plot of the length distributions of twostate/flat 
regions, for both the new and the previously proposed segmentation method. Most 
regions are short, with relatively few regions being longer than 100 bp. For lengths of 
around 1000, there is on average less than 1 region of each exact length value. 

Figure [8] further shows the detailed distribution of twostate regions of length 20 to 
200 using plain scales. From both figures it is apparent that the twostate regions of the 
new segmentation are generally shorter than for the previously proposed melting map 
segmentation. 

For the full chromosome 21, there are 1.3 million twostate regions according to the 
new segmentation, with an average length of 18. This compares with 0.5 million flat 
(twostate) regions for the previous melting map segmentation, with an average length 
of 55. 



4. Discussion and conclusion 



In this work, we have developed a segmentation of DNA sequences that is based on 
detection of local twostateness. Azbel's groundstate segmentation produced domains 
that partitioned or "tiled" the sequence [T2"] . His approach ignores any excited states. 
In contrast, our approach takes the ensemble into account, with the consequence that 
melting fork regions appear in between the twostate regions. Melting fork regions occupy 
a quite large fraction of the sequence (figure Isj. We found that the average lengths of 
twostate regions are 18 bp in human chromosome 21, while the average length of flat 
regions in the melting map was 55 bp [15]. This is an order of magnitude smaller than 
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Figure 7. Length distribution of twostate regions in human chromosome 21. This 
is shown using logarithmic scales, with segment length between 10 and 1000 on the 
x-axis, and count of segments per exact length value on the y-axis. Smoothed curves 
are shown for both the PLoS meltmap segmentation and the boundary probability 
segmentation. 

previously reported lengths of melting domains [21] . 

These findings may have important implications for how to utilize DNA melting in 
a genomic context, in defining the extent of local DNA singlestrandedness of a genome, 
and how it may influence the informational context upon which other molecules act. 

A biproduct of the methodology is the characterization of the fluctuational noise 
level in terms of the sequence dependent e m ; n . From a few sequences studied so far, we 
have computed values in the range 10 3 -10 4 . Furthermore, we have found this noise level 
to vary less than one order of magnitude over the melting range of temperatures, being 
roughly the same at high and low temperatures. 

Melting fork regions have a preferred direction: either 01 or 10. Although a few 
double fork regions may actually be physically meaningful, we suggest avoiding double 
fork regions by using e > e min , in order not to mistake noise for a signal. 

We have aimed at generality with respect to the statistical mechanical DNA melting 
model. But so far, we have only applied the methodology to the Poland-Scheraga model. 
It remains to be investigated how results would depend on using other models, and if it 
would be computationally viable at all. It could be a way of comparing different DNA 
melting models to compute a segmentation with each. 

The generality criterion imposed some algorithmic restrictions, for example, that 
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Figure 8. Length distribution of twostate regions in human chromosome 21. This is 
shown using plain scales, with segment length between 20 and 200 on the x-axis, and 
count of segments per exact length value on the y-axis. Curves are shown for both the 
PLoS meltmap segmentation and the boundary probability segmentation. 

we could not exploit properties of the Poland-Scheraga partition functions or the nearest 
neighbour stability parameterization. It is possible that a specialized treatment would 
be able to exploit such detailed knowledge to obtain much more efficient segmentation 
algorithms. 

The temperature scanning algorithm is simple and has algorithmic complexity 
O(N), but we found that small stepsizes (AT = 0.1°C) are needed to obtain sufficient 
accuracy. Unfortunately, this means long computation times for human genomic 
sequences. In this work, we have not concentrated further on algorithm optimization. 

As a possible application, the boundary probability segmentation may provide 
annotations of all the twostate regions and melting fork regions in the human genome, as 
we did in Chromosome 21. Regions is a type of annotation that is well accommodated by 
existing genome browsers. For such applications, it would be useful to add an attribute 
to the twostate regions indicating their (average) melting temperature. Compared with 
a standard melting map, such a melting domain map emphasizes more the essential 
features of cooperativity, while reflecting to a lesser degree the variation in GC content. 
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Appendix A. Calculation of boundary probabilities in the Poland-Scheraga 
model 



The two boundary probabilities can be computed from partition functions in the Poland- 
Scheraga model as follows: 

Zxoi(x)Z 01 x(x) 

Poi(x) = - (A.l) 

, \ Z X io(x - l)Z wx (x - 1) 
PioW = ^ • ( A -2) 

Zxoi(x) and Zxio(x — 1) are partition functions of the segment [l,x], Z m x(x) and 
Z\qx{x — 1) are partition functions of the segment [x — 1, N], and Z is the total partition 
function of the whole chain [25]. All these partition functions are computed in our 
previously described algorithm [2E]. In the language of 



PVtotal 



PVtotal 

Alternatively, one can use the Poland-Fixman-Freire algorithm [IS] [27]. This 
algorithm is highly optimized for the purpose of computing p\(x) profiles. The PFF 
algorithm first computes all the conditional probabilities Pi\i(x) (forward sweep) and 
then all the p\{x) (backward sweep). From these two arrays, we readily obtain the 
boundary probabilities. By (5c) we derive 

Piq(x) =pi(x- 1)pq\i(x) (A.5) 
= pi(x - 1) -pi(x - l)p 1 | 1 (x). (A. 6) 



By (15) and (A. 6) we derive 



Poi(x) = pi(ar) - px{x - l)pi|i(x). (A. 7) 
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